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1 Introduction 


Majorization algorithms are popular these days. The basic idea is simple. To minimize 
a real valued target function / : 5 ^ R on a set S C M n we use an iterative algorithm in 
which iteration k + 1 updates x ^ G S to x^ k+1 ^ G S in two substeps. In the first substep we 
find a function g that lies above the target function in S and touches it in the current x^ k \ 
In the second substep we find x by minimizing the majorizing function g over S. This 
produces a strictly decreasing sequence of target function values f (k ' 1 = f(x^), which forces 
convergence under some natural additional conditions (D’Esopo (1959), Zangwill (1969)). 

Early majorization algorithms for specific classes of problems were described by Dempster, 
Laird, and Rubin (1977) and De Leeuw (1977). Both papers suggest that a general class of 
algorithms lies behind their proposals. As a natural next step, some more general families of 
majorization methods were discussed in Vosz and Eckhardt (1980) and Bohning and Lindsay 
(1988). A general theory, inspired by both Dempster, Laird, and Rubin (1977) and De Leeuw 
(1977), was introduced in De Leeuw (1994) and Heiser (1995), and a much improved and 
expanded version is now available in book form in Lange (2016) and De Leeuw (2016). 

Minimizing / is done by constructing majorizations g. But of course we can also maximize / 
by using minorizers g. Thus Lange and co-workers (for example Hunter and Lange (2004)) 
defined the class of MM algorithms, which cleverly covers both majorization-minimization 
and minorization-maximization. In this paper we only talk about majorization-minimization, 
because it is trivial to switch from one to the other anyway (by using —/ and —g). 

Now for notation. 

• The real numbers are M, the positive reals are M + , and the vector space of n-tuples of 
real numbers is M n . The extended reals (with ±oo) are M. 

• I have already used the notation f : X => y for a function from X to y. If / : X®y Z 

then : X Z for each y G y. Thus f{»,y)(x) = f{x,y). If f{x) = ||a;||, for 

example, I also use the notation || • || for the function /. Throughout I try to distinguish 
between the function and the values it takes, so I avoid saying “the function f(x)=ax+b”. 

• Successive derivatives of / : X y are T>f, T> 2 f, and so on. If the domain A is a 
subset of the real line M we also use f, f", f ", f lv , and so on. If g : X ® y =>■ Z we use 
Vig^Vog.Vug = VfDig and so on. See Spivak (1965), p. 44-45. 

• The symbol = A is used for definitions. 

• End of proof is M. 

2 Majorization Basics 

Definition 1: Suppose / : S M and g : S => M. Then we say g majorizes / on S if 
g(x) > fix) for all ig 5. If S is all of M" we usually leave out the “on S'\ 

Example 1: If / : M M is a non-trivial cubic and g : M M is a quadratic, then g does 
not majorize / on M and / does not majorize g on M. Majorization of / by g would imply 
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g(x) — f{x) >0 for all i6l, and g — f is a non-trivial cubic, which cannot be non-negative 
on the whole line. Similar for majorization of g by /. 

Definition 2: If g majorizes f on S then the support set of the majorization is the set 
of all x G S with g(x) = f(x). Thus for x G S not in the support set we have g(x) > f(x). 
Elements of the support set are support points. 

Note 1: We usually abbreviate u g majorizes / on S with y £ S a support point of the 
majorization” to u g majorizes / on S at y”. 

Note 2: The set Q of all functions majorizing / on S is convex. If we order Q using 
majorization and define functions g Ah and g V h by (g A h)(x) = mm(g(x),h(x)) and 
(g V h){x) = ma x(g(x),h(x)) then Q becomes an inf-complete lattice. Both the convexity 
and the lattice property remain true for the set Qy of all functions majorizing f on S with a 
given support set Y. Both G and Gy have as their unique minimum element the function /. 

Example 2: There can be zero, one, a finite number, or an infinite number of support 
points of a majorization. The example in figure 1, from De Leeuw and Lange (2009), has 
g(x) = x 2 and f(x) = x 2 — 10sin 2 (a;). The support set are all integer multiples of 7r. 


o 



x 

Figure 1: Support Set 

Definition 3: A strict majorization on S at y is a majorization with a unique support 
point. 

Note 3: If / is convex, then g with g[x) = f(y ) + z'(x — y) majorizes / at y, for any z in 
the subgradient df{y). If / is strictly convex the majorization is strict. 

Note 4: If / is two times differentiable, and there is a B such that B — D 2 /(x) is positive 
semi-definite for all x then g with g{x) = f(y ) + {x — y)'Vf{y) + \{x — y)'B{x — y) majorizes 


3 



/ at y. 

Note 5: If a differentiable g majorizes a differentiable / on M then T>f(y ) = T>g(y ) at 
any support point. If a twice differentiable g majorizes a twice differentiable / on M then 
V 2 g(y) >z V 2 f(y) at any support point, i.e. V 2 g(y) — V 2 f(y) is positive semi-definite. This 
is because the differentiable function h = g — f has a minimum, equal to zero, at any support 
point. If majorization is strict we have V 2 g(y) y T> 2 f(y). 


3 Majorization Algorithm 

Definition 4: A majorization algorithm is an iterative algorithm intended to minimize 
/ over x G S. Iteration k starts with a current G S. Select a g that majorizes / on S at 
xW and find ad fc+1) G S such that g(x^ k+1 ^) < g(x^). If there is no x G S with g{x) < g(xW) 
the algorithm stops. 

The key to why majorization algorithms work (i.e. converge) is the following result. 

Theorem 1: Suppose g majorizes f on S at y. Then, for all iGiS, g(x) < g(y) implies 
f(x) < f(y). 

Proof: f(x) < g(x) by majorization, g[x ) < g(y) by assumption, and g(y) = f(y) by support. 
Thus we have the sandwich inequality f(x) < g{x) < g(y) = f(y)- If majorization is strict 
this becomes f(x) < g(x) < g(y) = f(y). But even if a; is a second support point of the 
majorization we still have f(x) = g(x) < g(y) = f(y). ■ 

Note 6: Definition 4 does not tell us how to select majorizations. In that sense it is an 
incomplete definition, which makes it impossible to study the properties of the algorithm. To 
actually get an implementation going, we need a more complete definition. 

Definition 5: A majorization scheme for / : 5 ^ 1 on 5 is a function 

such that 

• g(x,y) > f(x) for all x,y eS, 

• g(x,x) = f(x) for all x G S. 

In other words, for each y G S the function g(»,y) majorizes / on S at y. 

Definition 6: [Redone] Suppose g is a majorization scheme for / on S. In a majorization 
algorithm iteration k starts with a current x^ G S. We then choose x^ k+1 ^ G S such that 
g(x^ k+1 \ x^) < g{x^ k \x^). The sandwich inequality becomes 

f (k+1) < g{x (k+1 \x^ k) ) < g(x w ,x {k) ) < f(x^). 

If there is no x G S with g(x,x^) < g{x^ k \ x^) the algorithm stops. 

Note 8: Not every majorization scheme leads to a useful majorization algorithm. The 
function g with g(x, y) = f(y) + a\x — y\ is a majorization scheme for / for any a > 0. But it 
is impossible to choose such that g{x^ k+l \x^) < g(x^ k \x^), so the algorithm stops 

immediately at any initial solution x^°\ 
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4 Fans of Functions 


Definition 7: A fan on S at y £ S is a function g : S ® A M, with A a real interval, 
such that 

• g(y, •) is a constant function, 

• If a < (3 then g(x, a) < g(x, f3 ) for all x j - y. 

Thus for a < (3 we have g(»,(3) strictly majorizing g(«,a) at y. 

Note 7: Suppose 

g(x, 0) = inf g(x,a ) > — oo. 
aeA 

Then g(»,a) strictly majorizes g(»,0) for every a G A. 

Note 9: Suppose g(»,a) is differentiable at y for all a. Then g(»,f3) — g(»,a) has either a 
minimum or a maximum at y , and thus T>ig(y, a) = T>ig(y, /3). Thus all g(», a) have the same 
tangent at y. If the g{»,a) are twice differentiable and a < (3 then Vug(y,a ) y< T> u g(y,f3) 
in the Loewner order. 

Example 4: Figure 2 is an example of a fan that is quadratic in x at y — 3 and linear in 
with 

g(x, a) = 1 + 2(x — 3) + -a(x — 3) 2 . 

Figure 2 plots the quadratic functions g(», a) for a = 1, • • • , 10. They have their minimum at 
3—2/a, with minimum value 1—2/a. The common tangent is the blue line g(», 0) = l+2(x—3). 



x 


Figure 2: Fan, Quadratic in x 

The linear functions g(x, •) are plotted in figure 3, for x taking 50 values between —10 and 
+10. The blue line is the function min x g(x,a) = 1 — 2/a. 
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Figure 3: Fan, Linear in alpha 

Example 5: This is a minor modification of example 4. We use 

g(x, a) = 1 + 2(x — 3) + ^ log(a)(x - 3) 2 . 

This fan is still quadratic in x , but for 0 < a < 1 the quadratics are concave. Also for this 
example there is no g(x, 0), because inf Q , >0 g(x, a) = —oo for all s^3. 



-4 -2 0 2 4 


x 

Figure 4: Fan, Logarithmic in alpha 
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Definition 8: If g : S x A =>■ M then y G 5 is a common point of g if g(y, a) = g(y, (3) 
for all a and (3 in A. i.e. if g(y, •) is a constant function on A. Thus if g is a fan on S at y, 
then y is a common point of the fan. 

Theorem 5: A fan cannot have more than one common point. 

Proof: Suppose g is a fan on S at y and z G S is a second common point. Then g(z, a) = 
g(z,j3), contradicting that either g(x,a ) < g(x,/3) for all x 7 - y or g(x,/3 ) < g(x,a ) for all 
z^y.m 

Example 6: Suppose the function g : M. ® A =>■ M consists of the quartics g(», a) with 
g(x, a) = ot{x — 3) 2 (x + l) 2 . Then g has the two common points ±1 and thus g is not a fan. 



X 


Figure 5: This is not a Fan 

Note 10: The fan in figures 2 and 3, like most of the fans considered in this paper, actually 
has additional structure, captured in the following definition. 

Definition 10: An additive fan on S at y is a fan of the form g(x, a) = p(x) + aq{x ) 
with q{x) > 0 for all x j- y. 

Note 11: In an additive fan at y, since g{y,ot ) is the same for all a G A, it follows that 
q(y) = 0 and g(y,cn ) = p(y). Suppose the additive fan is differentiable. Then, in the same 
way, T>ig(y, a) does not depend on a, and thus T>q(y ) = 0 and T>ig(y, a) = Vp(y). Since q 
has a strict local minimum at y we also have T> 2 q(y) >- 0 . 


5 Majorizing Fans 

Definition 11: A majorizing fan on S at y is a fan g : S 0 A =>• M on S such that 
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• For all a G A the function g(m, a) majorizes / on S at y. 

Thus the common point of the fan is the support point of all majorizations in the fan. 

Note 12: Our key result for majorizing fans is a generalization of a result first proved 
by Van Ruitenburg (2005). Our version of Van Ruitenberg’s result does not suppose any 
particular functional form for the majorizations in the fan, while Van Ruitenburg (2005) uses 
a restricted class of polynomials. Our general definition is not even restricted to univariate 
functions. 

Theorem 2: Suppose g is a majorizing fan for / on S with common support point y. There 
can be at most one g(»,a) with more than one support point. This g, if it exists, is the 
minimum element of the fan. 

Proof: From the following two lemmas. ■ 

Lemma 1: Suppose g is a fan majorizing / on S at y. If g(«, a) has a second support point 
y and g(»,/3) has a second support point v ^ y, then a = (3. 

Proof: Suppose g(»,a) strictly majorizes g(»,/3). Then g(u,/3) < g(u,a ) = f{u) and thus 
g(»,/3) does not majorize /. If g(»,/3) strictly majorizes (?(•,«) then g(v,a ) < g(v,f3) = f(v) 
and thus #(•,«) does not majorize /. The contradictions proves g(»,a) = g(»,f3). ■ 

Lemma 2: Suppose g is a fan majorizing / on S at y. If g(», a) has a second support point 
z ^ y then g{»,a) is majorized strictly by all g{»,(3) with (3 a, and is consequently the 
minimum point of the fan. 

Proof: Suppose g(»,a) strictly majorizes g(»,(3). Then g(z,/3) < g(z,a) = f(z) and thus 
g(»,/3) does not majorize /. Consequently g(»,/3) strictly majorizes g(»,a). ■ 

Note 13: Note that we have not shown that in a majorizing fan a majorization with more 
than one support point always exists. And neither have we shown that having two or more 
support points is necessary for minimality in the fan. 

Example 10: The functions g(»,a) with g(x,a ) = ax 2 are a majorizing fan for / with 
f(x) = — x 2 at y — 0. For each a there is only a single support point. g(», 0) = 0 is the 
minimal element. 

Example 7: Suppose g has g(x, a) = f(x) + aj|a; — y\\ for any norm || • ||. For a > 0 this 
is a majorizing fan of / with common support point y and minimum element /. If g(», a) 
has a second support point z ^ y then we must have g(z, a) = f(z) + a\\z — y\\ = f(z) and 
thus a = 0, and g(», a) = f. 

Example 8 : Suppose g with 

g(x) = f(y ) + f'{y)(x -y) + ^a(x - y) 2 

is a majorizing fan for the differentiable / at y. If for some a the majorization has a second 
support point then it is the minimum element of the majorizing fan. 



Note 13: If g with g(x, a ) = p(x) + aq(x) is an additive majorizing fan for / on S at y 
then p(y) = f(y) and Vp(y) = Vf(y). Since V u g(y, a) h V 2 f(y) we have 

aV 2 q(y)hV 2 f(y)-V 2 p(y), 

which is particularly interesting in the one-dimensional case, where it becomes 

^ f"(y) - P"(y) 
a - q"(y ) ■ 

In the quadratic fan of example 8 this simply becomes a > f"(y). 


6 Checking Majorization 

Theorem 3: Suppose g is a fan on S at y. Define 5 : A R. by 

5(a) = inf {g(x,a) - f(x)}. 

x Go 


• If is majorizing fan for / on S at y then inf Qe _4 5(a) = 0. 

• If / is continuous and g is jointly continuous on S x A then inf ae _ 4 < 5 (a) = 0 is also 
sufficient for g to be a majorizing fan. 

Proof: For a majorizing fan the minimum of g(x,a ) — f(x) over S is attained at y, and 
5(a) = 0 for all aed. For any fan g(y, a ) — f(y) = 0 and thus 5(a) < 0. If inf ae ^ 5(a) < 0 
there is at least one a G A and one x G S such that g(x, a) — f(x ) < 0 , and thus g is not a 
majorizing fan. ■ 

Note 14: 5 may take the value —oo. But 5 is finite-valued if S is compact and the majorizer 
and majorant are continuous, or if we can assume that g(x , a) — f(x) attains its minimum in 
S for all a G A. 

Note 17: Because g(x,a) < g(x,/3 ) if a < f3 the function 5 is increasing, and because of 
continuity inf Qe _4 5(a) = lim a ^ 5(a) = 5(a). 

Note 16: For any fan 

^l={a | 5(a) > 0} 

then A = [a, +oo). 

Theorem 4: Suppose g with g{x) = p(x) + aq(x) is an additive fan on S at y. Define 


A 

a= sup 

x£S\y 


p(x) - f(x) 

q(x) 


< +oo 


Proof: We have majorization if p(x) + aq(x) — f(x) > 0 or 



for all x G S with x 7 ^ y, which is equivalent to the condition in the theorem. ■ 
Two points 


g(z,a) - f(z) = 0 
V 1 g(z,a) = f\z) 


7 Additive Taylor Fans 

7.1 Even-order Taylor Polynomials 

For r odd, i.e. r + 1 even, we can use the additive fan 

g(x,a) = - y) s + * a(x - y) r+1 . 

s=0 S ■ V ' L J- 

Van Ruitenburg (2005) considers the additive fan 

g(x, a) = f(y ) + f'(y){x -y)+ . * , a(a; - y) r+1 . 

(r + 1)! 

I see no reason to drop the terms of degree 2 to r, although of course for these intermediate 
terms we are not forced to use the f^ s \y) as coefficients. In fact, including degree 2 and 
higher allows us to design majorization algorithms with faster convergence. 


7.2 Odd-order Taylor Polynomial 


g{x) = f {s) (y)( x - y) s + tt^vA x - y\ r+1 


s =0 


Si 


(r + 1 )! 


8 Univariate Examples 

8.1 A Quartic with a Quadratic Fan 

Suppose / is the quartic 

f(x) = f(y) + f'(y)(x -y) + ^f"(y)(x - y ) 2 + ^f"'(y)(x - yf + ^f w (x - y ) 4 
and we majorize / at y with the additive quadratic fan 


a = sup 

X 


g(x, a ) = f(y) + f(y)(x - y) + -a(x - y) 2 . 

f(x) - f(y) - f'(y)(x -y) \ .. ,1 iv( . 2 

-= sup/ (y) + -f ( y){x-y) + —f (x-y) 


\{x-y ) 2 


12 “ 
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if r > 0 we have a = +oo, and the fan does not majorize / at y. If f w < 0 the maximum 
is attained at 

2 f"'(y) 
y = 


X 


f* 


and is equal to 


a = f {y) - 3 fiv 


There is no guarantee that a > 0, so the sharp majorizing quadratic may be concave, 
in which case it does not have a minimum. The algorithm indicates, appropriately, that 
inf.,. f(x) = — oo. The convergence rate of the majorization algorithm at a local minimum x is 


f"(v) f lv U"\y)? 

a 3f"(y) — f iv {f"'{y)) 2 



x 

For the qua.rtic with f(x) = x 2 + — \x 4 we show the sharp quadratic majorizations at 

y = —1, y = —.5, and y — 2. For y — — 1 the second support point is at 1.6666666667, for 
y = —.5 it is at 1.1666666667, and for y = 2 it is at -1.3333333333. The convergence rate at 
the local minimum zero is 0.1, which is plenty fast. Figure xxx also illustrates things that 
can go wrong. If we start the algorithm at l|, then the successor is -1, and the algorithm 
stops at that local maximum. If we start the algorithm at 2 the majorizer is concave and 
does not have a minimum. 

y<— .9 

for (i in 1:10) { 
z <- succ(y) 
print (c(y,z)) 
y <- z 

} 
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## [1] -0.9000000000 -0.5786593707 
## [1] -0.5786593707 -0.1599667718 
## [1] -0.1599667718 -0.0210900511 
## [1] -0.021090051099 -0.002190018272 
## [1] -0.002190018272 -0.000219866182 
## [1] -2.198661820e-04 -2.199532066e-05 
## [1] -2.199532066e-05 -2.199619150e-06 
## [1] -2.199619150e-06 -2.199627859e-07 
## [1] -2.199627859e-07 -2.199628730e-08 
## [1] -2.199628730e-08 -2.199628817e-09 

cobwebPlotter (- . 9, succ ,-l ,. 5 , itmax=100) 



-1.0 -0.5 0.0 0.5 


x 
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X 

If we modify / to an even biquadratic by leaving out the third order term, i.e. f{x) = x' 2 — \x A . 
The convergence rate at zero is now 0, which indicates superlinear convergence. In fact the 
algorithm converges to zero in a single step if started anywhere between the two local maxima 
at ±v^2- The second support point for majorization at y is always —y. The best quadratic 
majorizer at y — is the horizontal line with function value identically equal to 1. 


8.2 A Quartic with a Cubic Fan 

Alternatively, again for 

f(x) = f(y) + f(y)(x -y) + * f"(y)(x - yf + ^f"(y)(x - yf + ^f w (x ~ vY 
consider the additive cubic fan 

9 (x, a) = f(y) + f(y)(x - y) + ^f"(y)(x - yf + ^ a\x - y\ 3 . 


Note that g(», a) is two times continuously differentiable. Since g"(x, a) = f"(y) + a\x — y\ it 
follows that if f"(y) > 0 and a > 0 then g(m , a) is convex. 

For this example 


a = sup 


l f"\y)^-yf + hr{x- y ) 


x 


V 


= sup sign(x - y) ( f"'(y ) + -f lv {x - y ) 


If f w > 0 the fan does not majorize f at y because a = +oo. If f lv < 0 then a = \f"'(y)\, 
and the maximum is attained at x — y. Note there is only a single support point in this case, 
or, if you like, the second support point coincides with the first. 
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LD 

C\j 



X 


The majorization algorithm minimizes 

g(x,a) = f(y ) + f(y){x -y) + ^f"(y)(x - yf + ^ \f'"(y)\\x - y\ 3 . 

Minimization problems of this form are analyzed in an Appendix. If y is close to a local 
minimum we will have f"(y ) > 0, and consequently x — y = — b + \Jb 2 — 2c if c < 0 and 
x — y = b — \/b 2 + 2 c if c > 0 . 


Ky ) 
c(y) 


f"(y ) 

\f"\y)\ 

f\y) 


y<— .9 

for (i in 1 : 10 ) { 
if (y == 0 ) next 
z <- succ(y) 
print (c(y,z)) 
y <- z 

} 

## [1] -0.9000000000 -0.1855416182 
## [1] -0.18554161820 -0.00303932397 
## [1] -3.039323970e-03 -1.403766703e-08 
## [1] -1.403766703e-08 0.000000000e+00 
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cobwebPlotter (- . 9, succ, -1 , 1 , - . 5, . 5, itmax=10) 



X 


8.3 The Logit 

Consider / with f(x ) = — log7r(a:) 


tt(x) 


A 


1 

1 + exp(— x) 


Because n strictly increases from zero to one, / strictly decreases from +oo to zero. From 
tt'(x) = tt(x)(1 — tt(x)) we see that 


fix) = t r(x) - 1. 

Thus, as x goes from — oo to +oo, f strictly increases from —1 to 0. It follows that / is 
strictly convex. Also 

f"(x) = Tt\x) = 7 r ( x)(l — 7 r ( x )). 


This implies 

= °> 

as well as 

max f"{x) = /"(0 ) = J. 

It also follows that the r-th order derivative is a polynomial of degree r in tt(x). So if 
f r (x) = V r (7r(x)) then, because 7r is strictly monotone, 


inf Vri^ix)) = min V r (z) < f \x ) < max V r {z) = sup Vri^ix)) 

X 0 < 2<1 0 < 2<1 x 
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8.4 The Probit 

Consider / with fix) = — log(<f>(x)), where 

<h(x) = f 4>(z)dz , 

J — OO 

and 


0(^) = ^ exp {-^ 2 }- 


1 

2 : 


Thus 


f( x ) = 


<P( X ) 


= E {z | 2 : < x}, 


where z is a standard normal random variable. The connection between the derivative and 
the conditional mean is due to Sampford (1953). Thus f is strictly increasing from — 00 to 


lim fix) = sup fix) = 0, 

£—>•+00 x 


and / is strictly convex. Also 


t ,„ N , f <K x ) 

} (x) = x— T 

<h(x) (<h(x) 


= 1 — \{z | z < x}, 


which connect the second derivative to the conditional variance of a standard normal. This 
implies that f"(x) > 0, and f" strictly decreases with 


lim f"(x) = sup f"(x) = 1, 


x — y—OO 


and 


lim fix) = inf fix) = 0, 

x-S>+oo V ’ X J w 


which implies f is strictly concave. 


9 Multivariate Fans 


Now, of course, if we allow more general polynomials of the form 


g( x ) = f(y) + f(y){x- 



we can presumably do better, in the sense that we can find smaller majorizations. But 
the vectors (a s , ■ ■ ■ , a r+ 1 ) make the problem multivariate, and do not allow us to define 
improvement chains in any natural or obvious way. 
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9.0.1 Scaled Quadratic 


g(x) = f(x) + (x 


y)'Vf{y) + i a{x 


y)'E(x-y) 


X)y)(x-y)i 

9.0.2 Shifted Newton 

g{x) = f(y ) + {x - y)'vf(y) + - y)\V 2 f(y) + aS)(x - y) 

gu f(x) - /(y) -(x- y)"Df(y) _ (x-y)'V 2 f(y)(x-y) 
x \{x - y)'T,(x - y) (x - y)'T,(x - y) 

9.0.3 Nesterov-Polyak 

g(x) = f(x) + (x- y)’Vf{y ) + ^(x - y)'V 2 f(y)(x - y) + ^a||x - y \\ 3 

10 Appendix 

Consider the problem of finding all critical points (i.e. all points where the derivative vanishes) 
of the function / : M R. given by 

fix) = cx + -bx 2 + -|a ;| 3 
2 6 

with a 7 ^ 0. Note that / is coercive, and consequently has a global minimum. Also note that 
if b > 0 then / is strictly convex, and the global minimum is the unique critical point. 

The first and second derivative are 

f{x) — c + bx + -sign(a;)a; 2 , 

and 

f"(x) — b+\x\. 

If b is positive / is strictly convex. Thus f is increasing. Since /'(0) = c the unique solution 
of f(x) = 0, corresponding with the global minimum, is positive if c < 0 and negative if 
c > 0. If c < 0 we have x = — b + y/b 2 — 2c and if c > 0 we have x = b — \/b 2 + 2c. If b > 0 
and c = 0 then x — 0. If b = 0 then x = —^2\c\ if c > 0 and x = \j2\c\ if c < 0. If b = c = 0 
then x = 0. 


f(x) - f(y ) -(x- y)"Df(y) 

~ S ^ P \{x - y)'^{x - y) 


f(x) 


and thus 


f(y)-(x-y)"Df(y)< sup -(x-y)'V 2 f( Ax+(1 

0<A<1 A 


(x-y) V 2 f(Xx + 1 - A )y)(x - y) 

a < sup sup- - -—--< 

o<a<i x (x - yyh{x - y) 


17 





X X 

If b < 0 the situation is more complicated. If b < 0 and c = 0 then f'(x) = x(b + \\x\) and 
thus f'(x) — 0 for x — 0, x — 2 b, and x = —2b. f has a maximum equal to zero for x = 0 
and two minima equal to |6 3 for x = ±26. 



X 

Now consider the case that 6 < 0 and c ^ 0. We see that f"(x) > 0 for x < b and x > —6, 
while f"(x) < 0 for b < x < —b. Thus f'(x) first increases from —oo to its maximum 
f(b) = c + \b 2 , then decreases to its minimum /'(—6) = c — \b 2 , and then increases again to 
±oo. 

• If c ± |6 2 > c — |6 2 > 0 then f{x) = 0 has a unique solution x < b < 0. It is the global 
minimum at x = 6 — \/b 2 ± 2c. 

• If 0 > c ± \b 2 > c — \b 2 then f'(x) = 0 has a unique solution x > — b > 0. It is the 
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global minimum x = — b + \/\) 2 — 2c. 




X X 

• If c + \b 2 > 0 > c — \b 2 then f (x) = 0 has three solutions. There are two local minima 
at x — b — \Jb 2 + 2c < b < 0, and x — — b + \/b 2 — 2c > — b > 0, and one local maximum 
at b < x = —b — \/b 2 — 2c < —b. 



X 
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